perm filename DRAFT.1[AP,DBL]3 blob sn#116989 filedate 1974-08-30 generic text, type T, neo UTF8
00100	PUP6 Paper   			First Draft
00200	
00300	1. OUTLINE
00400		Outline
00500		Abstract
00600		Introduction
00700		Ideas
00800		Implementation
00900		Examples
01000		Targets
01100		Performance
01200		Conclusions
01300	
01400	2. ABSTRACT
01500	A system has been implemented which can synthesize large inductive inference
01600	programs from restricted English dialogues.  All knowledge and all
01700	control resides in highly structured pieces of code called BEINGS. Each
01800	BEING has a similar structure, and may therefore be viewed as an extension
01900	of the concept of ACTORS [Hewitt, 1973].  The output code of the system is
02000	also in the form of BEINGS, hence the synthesized program can answer
02100	questions about itself as it runs.   A general description of the
02200	system which realized these ideas is provided, and its target domain is
02300	discussed.  Some unexpected problems, and some unexpected rewards, were
02400	encountered.  Various measures of performance of Automatic Programming 
02500	Systems are proposed, and we compare the current system to previous ones.
02600	We conclude by pooling our ideas into a "view" of Automatic Programming,
02700	and mention future plans.
02800	
02900	2. INTRODUCTION
03000		In this paper we report on a Program-Understanding Program (PUP6)
03100	capable of generating large LISP programs. The methods employed are not
03200	formal, but rather involve structuring of knowledge about programming, about
03300	the task domain, and about transfer of control.  To date, PUP6 has
03400	synthesized three significantly different large (30-page) target programs:
03500	a Winston-like concept formation program [Winston,     ], a grammatical
03600	inference program, and an airline reservation system.  Extended dialogs
03700	are carried on with the user, in a small subset of English, in which the
03800	task is delineated and questionable details are discussed.  Although the
03900	details of these (dialogues and final programs) are the chief concern of
04000	the user, we assume that the reader is interested in studying the ideas
04100	PUP6 embodies, and how they are implemented.  So our treatment will follow
04200	these lines: First, the ideas are presented, including a discussion of
04300	BEINGS.  Next we describe the implementation, and explain our choice of
04400	targets. Only then will we mention performance. Finally, we relate this
04500	work to the field of Automatic Programming.
04600	
     

00100	3. IDEAS
00200		First, we resolve the uniformity vs. structure controversy.  The
00300	benefits of the former include easy addition of knowledge [Newell, l973]
00400	and simple methods for combining information [McCarthy,    ].  Structure,
00500	however, is necessary for efficient handling of large amounts of data. In
00600	PUP6, we integrate these two ideas into the concept of BEINGS.  A BEING is
00700	a collection of about thirty little bits of LISP code; the answers to thirty
00800	questions about the BEING. That is, we view a small program as equivalent
00900	to its answers to these fixed questions. Every scrap of knowledge, every bit of
01000	control structure should be encoded into BEINGS. There is nothing else in
01100	the system but this interacting community.  Notice that while each BEING is
01200	highly structured, this structure is uniform over the entire community. Each
01300	BEING part is itself a little BEING, etc., and we stop this infinite regress
01400	when the contents of the BEING part becomes meaninglessly primitive. Each
01500	BEING is cognizant of the set of thirty questions, and in answering one of
01600	them it may freely ask quesions of other BEINGS (often through nondetermi-
01700	nistic goal statements.)  The reader may glance below at the particular set
01800	of questions used, but we shall discuss our other ideas next.
01900		Since only BEINGS exist, all our code must be written as BEINGS, and
02000	must be written by BEINGS.  A crucial consequence is that _some_ BEINGS must
02100	write code. Our new idea is that the BEING who knows about X takes charge of
02200	generating code relating to X. For example, the SORT BEING can do sorting, and
02300	he can also write specialized sort routines, and he can answer questions about
02400	sorting.  A second consequence is that the output code will have all the
02500	"intelligence" that any BEING community has: it can write variations of itself,
02600	modify itself according to the user's complaints, and answer questions about
02700	what it is doing as it runs.
02800		In a similar vein, _some_ BEING(s) must do the translation of the 
02900	users' quasi-English inputs into BEING-usable form.  We choose to have each
03000	BEING X recognize English related to X.  Thus translation consists of
03100	querying "who can recognize ..." and waiting for a response. For example, 
03200	our  SORT BEING must also recognize and process phrases involving sorting or
03300	ordering.
03400		Three more ideas are present, constraints which reflect the author's
03500	philosophy:  one need not feel any reverence toward the anthropomorphic
03600	paradigm of programming (ignore details, catch them by blind debugging.)
03700	As in all programming, decisions continually crop up which the system is 
03800	not able to resolve at the time. We have the system spend a significant
03900	effort deferring the decision as long as possible. When, at last, no
04000	progress can be made without its resolution, if the system is still
04100	unsure, either the user settles the question or else a backtrack point is
04200	reluctantly set up.  If there are two or more decisions which can no longer
04300	be deferred, we tackle first the one estimated to be the easiest  [see
04400	Dijkstra et al, 1972].  The final bit of philosophy is that most of the
04500	carelessness bugs can be eliminated through this deferral and through
04600	very precise record-keeping.  Humans depend on their adaptability to 
04700	compensate for limitations in their brain hardware [see Newell and Simon,
04800	1973] but there is no need for an _automatic_ programming system to do so.
     

00100	4. IMPLEMENTATION
00200		The realization of any large system is worth study, since there
00300	seems to be a limit to the size of a program before it becomes unmanageable
00400	[Dijkstra et al, 1972] for humans.  Since PUP6 is 200 pages long, yet took
00500	only hundreds of man-hours, we shall describe how it was created.
00600		After the target program (concept formation) was chosen, it was
00700	rewritten cleanly, as a structured program [Gadwa, l973].  Next a complete
00800	dialogue was simulated between the user and a human programmer
00900	who played the role of a reasonable AP system, similar to studies by
01000	Balzer [197 ].  The dialog was then annotated: after each user response,
01100	comments were inserted which described the "states" the system-player had
01200	gone through before printing his next response.  This included blind paths
01300	which were tried, use of outside world knowledge, and, in general, was meant
01400	to be the "intelligence" necessary to do the task.  Our fear was that a
01500	system could be built which synthesized the concept formation program, and
01600	yet was so unintelligent that we learned nothing from it.  We hoped that
01700	any system which (i) got the target program correctly, (ii) followed our
01800	initial dialogue, and, most importantly, (iii) went through the same line
01900	of reasoning before responding that our comments indicated the human
02000	followed,  would be far along the road toward intelligence.
02100		At this point, we developed most of the ideas described in the
02200	preceding section, and chose a rough initial set of BEINGS. Each one had
02300	not much more than a name and a vague description of what it must do.
02400	The dialog was gone through again, painstakingly, and the comments were
02500	replaced by a description of which beings would call which other beings,
02600	why, and the results of each such call.  Our contraints on each being
02700	thus grew, sometimes changed, and occasionally a new BEING or BEING part had
02800	to be added to the community. This process was long (200 hours) and produced
02900	a long (200 pages) protocol, actually a hand trace of the system execution.
03000	About seventy BEINGS were needed, a dozen of which we viewed as domain-
03100	specific and the rest of which were domain-independent programming
03200	knowledge.  About thirty different BEING parts were called for, and they
03300	are listed in Appendix 1.  The next appendix describes each BEING briefly;
03400	only the most important knowledge is mentioned.
03500		A result of our method of incremental specification of BEINGS is
03600	that each BEING has only those parts which are going to be used sometime
03700	during the ensuing dialogue.  Many presentations of the resultant data are
03800	possible; in Appendix 1, for each BEING part, we give the percentage of
03900	BEINGS which had to have this part specified for them. In Appendix 2,
04000	for each BEING, we mention the number of of its parts we specified. On
04100	the average, each BEING part was present in   % of all BEINGS, and, also
04200	on the average, each BEING had     of its 30 parts specified.
04300		During the programming, 100 small functions were written to supple-
04400	ment the language. Most were typical current AI language features (pattern-
04500	matching, special data types) or tiny additions to INTERLISP (flatten,
04600	set-difference) or functions which manage BEINGS (add-a-new-being,
04700	print-a-being's-parts).  The initial coding took only a hundred hours, but
04800	several times that amount of work was required before the system was
04900	debugged to the point of successfully following the annotated dialogue.
     

00100	5. EXAMPLES
00200		Now we shall present examples of the following: programming
00300	knowledge BEING,  concept formation knowledge BEING,  and a description
00400	of a piece of the user-PUP6 dialogue.
00500		The OBTAIN_USABLE_INFORMATION  BEING is a typical, high-level,
00600	domain-independent creature.  The WHEN part informs us that this
00700	is always undesirable (-10) but may be indicated (+111) if there exists
00800	new but not yet usable information.  The META-CODE has us choose from the
00900	following four alternatives, each of which is itself a BEING:  translate
01000	something, get brand new information, analyze the implications of some
01100	existing information, extract a small, relevant subset of the existing
01200	information to concentrate upon.
01300		To PARTITION_A_DOMAIN in an inductive manner, we must make a few
01400	choices. The partition may be only partial, it may be only weak, and, most
01500	importantly, the BEING should partition by doing only some of these:
01600	accept a class-name and get some of its elements, accept an element and
01700	get its class-name, accept <element, class-name> pairs. Other beings know
01800	about each of these alternatives. During the partitioning, the only new 
01900	demon to keep active is the one called fringe-of-conciousness.
02000		The dialogue begins by PUP6 asking the user for any task. The user
02100	replies, "Write a program which does concept formation." There are many
02200	decisions that PUP6 knows about in writing a specialized concept formation
02300	program, and it manages to defer them all.  For example, it must choose
02400	between classificatory, comparitive, and metrical concept formation. Since
02500	all three involve partitioning a domain of examples, PUP6 decides it can
02600	defer this choice until after code for the partitioning is synthesized.
02700	Hours later, the system asks the user to make this decision, he picks the
02800	first alternative, which involves only partitioning, and the system prints
02900	that it has finished the task!  
03000		About a third of the way into the dialogue, the system learns it must
03100	compare the input scene against its internally stored models of concepts,
03200	until it finds one which doesn't fail the comparison. It asks the user
03300	what it means for scene S to fail the comparison to model M. The user 
03400	replies, "M is incompatible with the observed features of S."
03500	The IDEN part of the CONTRADICTS BEING recognizes this sentence pattern,
03600	and transforms it to (FORSOME F IN M_FEATURES (CONTRADICTS F S_FEATURES)).
03700	The BEING which inserts this into the synthesized code requires that the
03800	user pick some proper name for the function CONTRADICTS; say the user
03900	types in #.  The function FORSOME is so primitive that no specialized
04000	version of it is written normally.  The system informs the user of where
04100	it is by means of a cursor pointing into a graph of program code. This is
04200	analogous to the textual and dynamic indices which  Dijkstra suggests.
04300	Later in the dialogue, PUP6 decides it must expand the function #.  The
04400	being CONTRADICTS is again called on, this time  being asked how to write
04500	a specialized version of a contradiction test.  It replies that SOME_OF
04600	the following types of contradictionmay occur: PROBABILITY:0,
04700	PROBABILITY:1, and PROBABILITY:ε(0,1).  The SOME_OF BEING takes control,
04800	and asks if the decision can be deferred.  The deferral being looks about,
04900	first asking if there is any non-null piece of code that all three have
05000	in common. This fails, and eventually the deferral being reports failure.
05100	SOME_OF  sees it has no basis upon which to form a guess, and asks the user.
05200	ASK_USER takes over, and quickly finds out what it is that PUP6 wants to
05300	learn. The names of the three choices are so cryptic, that their WHAT and
05400	HOW parts are printed out to the user, as well as their names.  The user
05500	types back his choices, in our case all three.  SOME_OF composes three new
05600	function calls, separated by a conditional:
05700	  (COND ( (MEMBER             F   PROBABILITY:0_PART_OF_M_FEATURES)
05800		  (PROBABILITY:0_#    F   S_FEATURES))
05900	        ( (MEMBER             F   PROBABILITY:1_PART_OF_M_FEATURES)
06000		  (PROBABILITY:1_#    F   S_FEATURES))
06100	        ( (MEMBER             F   PROBABILITYε(0,1)_PART_OF_M_FEATURES)
06200		  (PROBABILITYε(0,1)_#  F  S_FEATURES))
06300	A trichotomy demon recognizes this as a split on a function's values being
06400	=0, =1, or between zero and one.  It asks whether this particaular function
06500	can only range inside the interval [0,1].  Probability answers affirmatively
06600	so SOME_OF replaces the final test by the constant T. Later on, PUP6 must
06700	worry about the other two tests, and about the three contradiction
06800	predicates.  These guys know that they are immediately encodable; a
06900	probability=0 contradiction means that arg1 must not occur in the world arg2
07000	yet it does. So it is defined as (MEMBER arg1 arg2).  A probability=1
07100	contradiction occurs when a feature arg1 must occur in the world arg2, yet
07200	it doesn't. SO its definition is simply (NOT (MEMBER arg1 arg2)).   It is
07300	impossible for a feature with probability in (0,1) to be in contradiction
07400	with any world, so this third predicate is defined as the constant NIL.
07500	IS_OF_TYPE recognizes that it must replace each of the two case tests,
07600	unless M_FEATURES is organized permanently into three parts.  The structure-
07700	inducing being will take over.  He finds nothing about this tripartite
07800	structuring which could damage any earlier synthesized code, and asks the
07900	user's blessing.  Notes are added to the list of coding warnings that
08000	any reference to the entire list of M_FEATURES must be replaced by either
08100	APPEND of the three new lists, or else by three separate statements. 
08200	GET_NAME is indirectly called, and he has the user name the three new
08300	lists of features; say he responds by calling them MUSTNOT, MUST, and MAY.
08400	The system would at this point draw an arrow on its graph of code, from
08500	the function call (# F S_FEATURES) to the new block of code:
08600	  (COND ( (MEMBER        F   MUSTNOT_LIST_OF_M)
08700		  (MEMBER        F   S_FEATURES))
08800	        ( (MEMBER        F   MUST_PART_OF_M)
08900		  (NOT (MEMBER   F   S_FEATURES))
09000	        (  T (COMMENT THIS "T" IS OK IN PLACE OF "MEMBER F MAY_PART_OF_M")
09100		   NIL))
09200	Notice that only a few messages have passed back and forth during
09300	all this processing; this shows the utility of having an annotated dialogue
09400	to compare against the actual trace. The user has the feeling of merely
09500	directing, constraining, and watching the activites of a busy programmer.
     

00100	6. TARGETS
00200		Considerable attention was spent on choosing an appropriate domain
00300	of target programs.  We now discuss our choice:  inductive inference.
00400	The obvious difficulty appears to be the size of the programs involved: they
00500	are two orders of magnitude larger than previously synthesized programs. But
00600	there are four big attractions:
00700	(i) A wide range of complexity exists, from a one-page sequence extrapolator
00800		to Meta-Dendral.
00900	(ii) Each increasingly sophistocated inductive inference (II) program uses
01000		many of the concepts embodied in simpler II programs.
01100	(iii) If a super-human target program is ever written, it could itself
01200		contribute to the field of Automatic Programming!
01300	(iv) Since we are the "experts" on inductive inference ourselves, our
01400		reliance on introspection is as valid -- and potentailly as
01500		valuable -- as Dendral's chemists' protocols.
     

00100	7. PERFORMANCE
     

00100	8. CONCLUSIONS
     

00100	9. ACKNOWLEDGEMENTS
00200	
00300		There are, of course, countless ideas embodied in any concrete
00400	project.  Sweeping philosophical assumptions are made simply in trying to
00500	do Automatic Programming [McCarthy, l9  ].   Any Program-Understanding
00600	Program should include the best parts of all the best old ideas.
00700	We rely on concepts gleaned from Actors [Hewitt, l973], Demons [Charniak,
00800	1973], heterarchy [Reddy, l973], structured programming [Dijkstra, l973],
00900	assertional data bases and flexible data types [Sacerdoti, 1973], pattern-
01000	directed invocation of procedural knowledge [Winograd, l972], the paradigm
01100	of dialogue [Floyd, 1972], and studies on program specification techniques
01200	[Green, l974].  Of course this list is incomplete; sophistocated ideas are
01300	captured in the languages themselves: QLISP [Sacerdoti], INTERLISP 
01400	[Teitelmann, l97 ], LISP [McCarthy, l9  ], and English [Anon.].  To this
01500	rich store, one may acheive new heights merely by synergy.
01600		The success of the BEINGS project, as well as the precursor PUP
01700	programs [Green et al., 1974] is due in large measure to the encouragement,
01800	advice, support, and creative enrgy of C. Cordell Green.  I have also drawn
01900	upon discussions with the SAIL Auotmatic Programming Group, and in par-
02000	ticular with R. Waldinger, D. Barstow, B. McCune, D. Shaw, and L. Steinberg.